To be delivered until 2023/01/06 23:59:59.
You will start by setting up a series of connections in order to extract some data with the Arduino. First make the connections as shown below. Mind the direction of the temperature sensor. If you have an incorrect position, you will be connection the power to the ground and vice-versa and you will damage the sensor. The photoresistor sensor on the other hand has no polarity.

On this problem, you will read temperature and luminance from the sensors and print them on the serial.
1) Code an Arduino sketch, where the value of temperature and luminance are printed to the serial. For each serial print that you make, print the value of temperature, then a semicolon, then the value of luminance with a new line (use no whitespaces). You can do this by using three separate Serial.print, with the last one being a Serial.println. Print values 5 times per second (use the delay function to control this). Manually influence the readings of the sensors, by covering the photoresistor or shining light on it, and by lightly and carefully touching the temperature sensor to increase its temperature readings.
Note that the temperature sensor appears not to be very reliable. Since the objective of this exercise is just to plot the results, this should not be an issue.
Copy and paste your arduino code below. You may use a python code cell, even though the code can not be run.
Hint: for the temperature value to be in celsius, divide the read value by 1024 and multiply it by 500. The luminance does not have to be converted
int PhotoResistirSensor = A0;
int TemperatureSensor = A1;
int TempInput; //Store the temperature input
double Temperature; //Store temperature [degrees]
int LuminInput; //Store the photo resistor input
double Luminance; //Store luminance
void setup() {
Serial.begin(9600);
}
void loop() {
TempInput = analogRead(A1);
Temperature = (double)TempInput *500 / 1024; // Convert temperature reading
LuminInput = analogRead(A0);
Luminance = (double)LuminInput; // Convert luminance reading
delay(200); //5 readings per second
Serial.print(Temperature);
Serial.print(';');
Serial.println(Luminance);
}
To import the data into Arduino, keep it running (the Serial Monitor must be closed in Arduino) and run the following code. Change the COM port to your own. This block of code will read 1000 values from the Serial. Given that each observation is taken every 0.2 seconds, it should take a minute and a half.
import serial
import time
ser = serial.Serial('COM4', 9600, timeout=1)
time.sleep(2)
data = []
for i in range(500):
line = ser.readline()
if line:
string = line.decode()
data.append(string)
ser.close()
Convert the data into a pandas dataframe and save it in a csv file. Besides the value of temperature and luminance, also include the time, considering the first observation at $t=0$ and every observation 0.2 seconds after the previous one. The file must be submitted in Fenix and included in your Github repo.
import pandas as pd
import numpy as np
# Create dataframe and name columns
ardu = pd.read_csv('lab5_data.txt', sep=';')
ardu.columns = ['Temperature','Luminance']
# Create time and add it to the dataframe
ardu['Time'] = np.arange(0,39.8,0.2).tolist()
ardu.head()
| Temperature | Luminance | Time | |
|---|---|---|---|
| 0 | 32.23 | 502.0 | 0.0 |
| 1 | 41.50 | 503.0 | 0.2 |
| 2 | 16.60 | 504.0 | 0.4 |
| 3 | 41.50 | 504.0 | 0.6 |
| 4 | 30.76 | 504.0 | 0.8 |
Plot the Temperature against time, the luminance against time and the temperature against the luminance.
import matplotlib.pyplot
# Define figure
plt.rcParams['figure.figsize'] = [15, 9];
fig, plotz = plt.subplots(3);
fig.tight_layout(pad=4.0)
# Plot Luminance vs Time, Temperature vs. Time and Temperature vs. luminance
plotz[0].plot(ardu['Time'], ardu['Luminance']);
plotz[0].set(xlabel='time',ylabel='Luminance' ,title='Luminance over time');
plotz[1].plot(ardu['Time'], ardu['Temperature'], color='orange');
plotz[1].set(xlabel='time',ylabel='Temperature' ,title='Temperature over time');
plotz[2].scatter(ardu['Luminance'], ardu['Temperature'],color ='green');
plotz[2].set(xlabel='Luminance',ylabel='Temperature' ,title='Temperature over Luminance');
In the luminance plot, the value remain constant for the first twenty seconds, when the photoresistor sensor is exposed to the ambient light without any disturbances. Then, a flashlight is pointed to the sensor, thus the rise in value. Five seconds later, the sensor is covered and the luminance level drops. The flashlight is again pointed to the sensor and the value rises again. For a brief moment, the sensor is just exposed to the ambient light and finally is again covered.
In the temperature plot the values are constantly changing, since it is not very reliable. Also, it is exposed to ambient air so several disturbances may change the values. Only when it is covered (during a few seconds starting in the 25 second's mark and again in the 33 second's mark) it presents stable values.
In the third plot, a scatter plot, the different situations can be identified as is represented in the following image.
For the databases part of this assignment, you will use the mimic-iii database from the laboratory session. Start by adding a few new tables to the database, using the SQL files included in the assignment's files. Open PGAdmin and connect to your mimic-iii database. To properly load these tables, load the following files exactly and by the order presented.
1) Run demographic.sql
2) Run lab_firstday.sql
You will now have to answer a few SQL questions.
1. Open the connection to your mimic-iii database. If you want, you can delete your credentials before submitting the assignment, but if you do so, please run the notebook first, for the results to be displayed.
import psycopg2 as psql
conn = psql.connect(host='localhost',
database='mimic-iii',
user='postgres',
password='Joao0131',
port=5432)
2. Create a function that receives an SQL query and automatically opens a cursor, queries the database, extracts the columns, creates a pandas database, and closes the connections.
import pandas as pd
def query_database(query):
conn = psql.connect(host='localhost',
database='mimic-iii',
user='postgres',
password='PASS',
port=5432)
cur = conn.cursor()
# Execute the query and fetch the results
cur.execute(query)
results = cur.fetchall()
# Extract the column names from the cursor
col_names = [desc[0] for desc in cur.description]
# Create a pandas dataframe from the results
df = pd.DataFrame(results, columns=col_names)
# Close the cursor and connection
cur.close()
conn.close()
return df
3. Query the table admissions filtering for admission type as emergency and insurance as private.
df = query_database("SELECT * FROM public.admissions WHERE admission_type='EMERGENCY' AND insurance='Private'")
df
| row_id | subject_id | hadm_id | admittime | dischtime | deathtime | admission_type | admission_location | discharge_location | insurance | language | religion | marital_status | ethnicity | edregtime | edouttime | diagnosis | hospital_expire_flag | has_chartevents_data | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12263 | 10011 | 105331 | 2126-08-14 22:32:00 | 2126-08-28 18:59:00 | 2126-08-28 18:59:00 | EMERGENCY | TRANSFER FROM HOSP/EXTRAM | DEAD/EXPIRED | Private | None | CATHOLIC | SINGLE | UNKNOWN/NOT SPECIFIED | NaT | NaT | HEPATITIS B | 1 | 1 |
| 1 | 12317 | 10067 | 160442 | 2130-10-06 01:34:00 | 2130-10-06 02:29:00 | 2130-10-06 02:29:00 | EMERGENCY | EMERGENCY ROOM ADMIT | DEAD/EXPIRED | Private | None | UNOBTAINABLE | None | OTHER | 2130-10-06 00:12:00 | 2130-10-06 01:30:00 | S/P MOTORCYCLE ACCIDENT | 1 | 1 |
| 2 | 12339 | 10088 | 149044 | 2107-05-12 18:00:00 | 2107-05-18 13:30:00 | NaT | EMERGENCY | TRANSFER FROM HOSP/EXTRAM | SNF | Private | None | UNOBTAINABLE | UNKNOWN (DEFAULT) | WHITE | NaT | NaT | UROSEPSIS | 0 | 1 |
| 3 | 12341 | 10090 | 176805 | 2124-01-12 14:26:00 | 2124-01-14 19:00:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | DISCH-TRAN TO PSYCH HOSP | Private | ENGL | NOT SPECIFIED | SINGLE | WHITE | 2124-01-12 11:56:00 | 2124-01-12 16:09:00 | OVERDOSE | 0 | 1 |
| 4 | 12349 | 10098 | 180685 | 2170-12-02 23:24:00 | 2170-12-03 15:55:00 | 2170-12-03 15:55:00 | EMERGENCY | EMERGENCY ROOM ADMIT | DEAD/EXPIRED | Private | None | UNOBTAINABLE | None | OTHER | 2170-12-02 22:45:00 | 2170-12-03 00:56:00 | STATUS POST MOTOR VEHICLE ACCIDENT WITH INJURIES | 1 | 1 |
| 5 | 12357 | 10106 | 133283 | 2161-09-14 22:22:00 | 2161-09-19 17:00:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME WITH HOME IV PROVIDR | Private | None | CATHOLIC | MARRIED | WHITE | 2161-09-14 14:45:00 | 2161-09-15 01:30:00 | HEADACHE | 0 | 1 |
| 6 | 12368 | 10117 | 187023 | 2138-06-05 17:23:00 | 2138-06-11 10:16:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME | Private | None | CATHOLIC | SINGLE | UNKNOWN/NOT SPECIFIED | 2138-06-05 11:42:00 | 2138-06-05 21:20:00 | FEVER | 0 | 1 |
| 7 | 12369 | 10117 | 105150 | 2138-11-09 18:08:00 | 2138-11-18 23:13:00 | 2138-11-18 23:13:00 | EMERGENCY | EMERGENCY ROOM ADMIT | DEAD/EXPIRED | Private | None | CATHOLIC | SINGLE | UNKNOWN/NOT SPECIFIED | 2138-11-09 15:42:00 | 2138-11-09 20:42:00 | FEVER | 1 | 1 |
| 8 | 12381 | 10126 | 160445 | 2171-07-12 06:02:00 | 2171-08-16 12:00:00 | 2171-08-16 12:00:00 | EMERGENCY | PHYS REFERRAL/NORMAL DELI | DEAD/EXPIRED | Private | None | UNOBTAINABLE | SINGLE | WHITE | NaT | NaT | LIVER FAILURE | 1 | 1 |
| 9 | 12382 | 10127 | 182839 | 2198-06-28 05:34:00 | 2198-07-20 14:56:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | REHAB/DISTINCT PART HOSP | Private | None | NOT SPECIFIED | None | WHITE | 2198-06-28 04:28:00 | 2198-06-28 05:52:00 | S/P MOTOR VEHICLE ACCIDENT | 0 | 1 |
| 10 | 12385 | 10130 | 156668 | 2161-01-30 16:26:00 | 2161-02-19 14:05:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME HEALTH CARE | Private | None | CATHOLIC | MARRIED | UNKNOWN/NOT SPECIFIED | 2161-01-30 10:35:00 | 2161-01-30 20:25:00 | ABSCESS | 0 | 1 |
| 11 | 39869 | 40310 | 186361 | 2144-07-11 15:02:00 | 2144-11-12 14:40:00 | NaT | EMERGENCY | TRANSFER FROM HOSP/EXTRAM | REHAB/DISTINCT PART HOSP | Private | ENGL | CATHOLIC | SINGLE | WHITE | NaT | NaT | FACIAL NUMBNESS | 0 | 1 |
| 12 | 39962 | 40601 | 182879 | 2184-08-04 05:44:00 | 2184-08-10 15:30:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | SNF | Private | ENGL | PROTESTANT QUAKER | MARRIED | WHITE | 2184-08-04 01:35:00 | 2184-08-04 06:47:00 | SEPSIS | 0 | 1 |
| 13 | 40440 | 42066 | 171628 | 2112-02-04 14:49:00 | 2112-02-11 12:00:00 | 2112-02-11 12:00:00 | EMERGENCY | TRANSFER FROM HOSP/EXTRAM | DEAD/EXPIRED | Private | ENGL | CATHOLIC | SINGLE | WHITE | NaT | NaT | TRACHEAL STENOSIS | 1 | 1 |
| 14 | 40512 | 42292 | 138503 | 2162-01-16 13:56:00 | 2162-01-19 13:45:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME | Private | ENGL | CATHOLIC | SINGLE | WHITE | 2162-01-16 11:28:00 | 2162-01-16 16:12:00 | PNEUMONIA/HYPOGLCEMIA/SYNCOPE | 0 | 1 |
| 15 | 40993 | 43881 | 172454 | 2104-09-24 17:31:00 | 2104-09-30 16:17:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME HEALTH CARE | Private | ENGL | NOT SPECIFIED | MARRIED | WHITE | 2104-09-24 12:07:00 | 2104-09-24 18:50:00 | ACUTE PULMONARY EMBOLISM | 0 | 1 |
| 16 | 40994 | 43881 | 167021 | 2104-10-24 09:44:00 | 2104-11-01 11:59:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME | Private | ENGL | NOT SPECIFIED | MARRIED | WHITE | 2104-10-24 07:17:00 | 2104-10-24 11:10:00 | UPPER GI BLEED | 0 | 1 |
| 17 | 41054 | 44083 | 131048 | 2112-05-22 15:37:00 | 2112-05-25 13:30:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME HEALTH CARE | Private | ENGL | CATHOLIC | SINGLE | WHITE | 2112-05-22 09:25:00 | 2112-05-22 17:04:00 | SHORTNESS OF BREATH | 0 | 1 |
| 18 | 41055 | 44083 | 198330 | 2112-05-28 15:45:00 | 2112-06-07 16:50:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME | Private | ENGL | CATHOLIC | SINGLE | WHITE | 2112-05-28 13:16:00 | 2112-05-28 17:30:00 | PERICARDIAL EFFUSION | 0 | 1 |
| 19 | 41092 | 44228 | 103379 | 2170-12-15 03:14:00 | 2170-12-24 18:00:00 | NaT | EMERGENCY | EMERGENCY ROOM ADMIT | HOME HEALTH CARE | Private | ENGL | NOT SPECIFIED | SINGLE | WHITE | 2170-12-15 02:22:00 | 2170-12-15 05:25:00 | CHOLANGITIS | 0 | 1 |
4. Query the table admissions, filtering for the same conditions as the previous exercise (admission type as emergency and insurance as private). Join the "drgcodes" table on the admission ID. Display only the columns regarding the subject id, admission id, time of death, and description of the drug.
df = query_database("SELECT admissions.subject_id," +\
" admissions.hadm_id," +\
" admissions.deathtime," +\
" drgcodes.description" +\
" FROM admissions " +\
" JOIN drgcodes" +\
" ON admissions.hadm_id=drgcodes.hadm_id" +\
" WHERE admissions.admission_type='EMERGENCY' AND admissions.insurance='Private'")
df
| subject_id | hadm_id | deathtime | description | |
|---|---|---|---|---|
| 0 | 10130 | 156668 | NaT | MAJOR SMALL & LARGE BOWEL PROCEDURES WITH COMP... |
| 1 | 10117 | 187023 | NaT | DENTAL & ORAL DIS EXCEPT EXTRACTIONS & RESTORA... |
| 2 | 10011 | 105331 | 2126-08-28 18:59:00 | DISORDERS OF LIVER EXCEPT MALIGNANCY, CIRRHOSI... |
| 3 | 10117 | 105150 | 2138-11-18 23:13:00 | SIMPLE PNEUMONIA & PLEURISY AGE >17 WITH COMPL... |
| 4 | 10106 | 133283 | NaT | NERVOUS SYSTEM INFECTION EXCEPT VIRAL MENINGITIS |
| 5 | 10126 | 160445 | 2171-08-16 12:00:00 | Liver Transplant |
| 6 | 10126 | 160445 | 2171-08-16 12:00:00 | ECMO OR TRACHEOSTOMY WITH MECHANICAL VENTILATI... |
| 7 | 10088 | 149044 | NaT | SEPTICEMIA AGE >17 |
| 8 | 10098 | 180685 | 2170-12-03 15:55:00 | Craniotomy for Trauma |
| 9 | 10098 | 180685 | 2170-12-03 15:55:00 | CRANIOTOMY AGE >17 WITH COMPLICATIONS, COMORBI... |
| 10 | 10067 | 160442 | 2130-10-06 02:29:00 | OTHER OPERATING ROOM PROCEDURES FOR MULTIPLE S... |
| 11 | 10127 | 182839 | NaT | ECMO OR TRACHEOSTOMY WITH MECHANICAL VENTILATI... |
| 12 | 10127 | 182839 | NaT | Musculoskeletal & Other Procedures For Multipl... |
| 13 | 10090 | 176805 | NaT | POISONING & TOXIC EFFECTS OF DRUGS AGE >17 WIT... |
| 14 | 10090 | 176805 | NaT | Poisoning Of Medicinal Agents |
| 15 | 43881 | 172454 | NaT | Pulmonary Embolism |
| 16 | 43881 | 172454 | NaT | Pulmonary Embolism |
| 17 | 43881 | 172454 | NaT | PULMONARY EMBOLISM W MCC |
| 18 | 43881 | 167021 | NaT | Digestive Malignancy |
| 19 | 43881 | 167021 | NaT | Digestive Malignancy |
| 20 | 43881 | 167021 | NaT | OTHER DIGESTIVE SYSTEM O.R. PROCEDURES W MCC |
| 21 | 44228 | 103379 | NaT | Septicemia & Disseminated Infections |
| 22 | 44228 | 103379 | NaT | Septicemia & Disseminated Infections |
| 23 | 44228 | 103379 | NaT | SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W... |
| 24 | 42292 | 138503 | NaT | Contusion, Open Wound & Other Trauma To Skin &... |
| 25 | 42292 | 138503 | NaT | Contusion, Open Wound & Other Trauma To Skin &... |
| 26 | 42292 | 138503 | NaT | TRAUMA TO THE SKIN, SUBCUT TISS & BREAST W MCC |
| 27 | 44083 | 198330 | NaT | Other Circulatory System Procedures |
| 28 | 44083 | 198330 | NaT | Other Circulatory System Procedures |
| 29 | 44083 | 198330 | NaT | MAJOR CARDIOVASC PROCEDURES W/O MCC |
| 30 | 40310 | 186361 | NaT | Other O.R. Procedures For Lymphatic/Hematopiet... |
| 31 | 40310 | 186361 | NaT | Other O.R. Procedures For Lymphatic/Hematopiet... |
| 32 | 40310 | 186361 | NaT | LYMPHOMA & NON-ACUTE LEUKEMIA W OTHER O.R. PRO... |
| 33 | 44083 | 131048 | NaT | Other Respiratory Diagnosis Except Signs, Symp... |
| 34 | 44083 | 131048 | NaT | Other Respiratory Diagnosis Except Signs, Symp... |
| 35 | 44083 | 131048 | NaT | PLEURAL EFFUSION W CC |
| 36 | 40601 | 182879 | NaT | Acute Myocardial Infarction |
| 37 | 40601 | 182879 | NaT | Acute Myocardial Infarction |
| 38 | 40601 | 182879 | NaT | ACUTE MYOCARDIAL INFARCTION, DISCHARGED ALIVE ... |
| 39 | 42066 | 171628 | 2112-02-11 12:00:00 | Respiratory System Diagnosis w/ Ventilator Sup... |
| 40 | 42066 | 171628 | 2112-02-11 12:00:00 | Respiratory System Diagnosis w/ Ventilator Sup... |
| 41 | 42066 | 171628 | 2112-02-11 12:00:00 | RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPP... |
5.1. Obtain the dataset for this problem, by running the SQL query below.
query = "SELECT pivoted_lab.*," +\
"gender as gender," +\
"admission_age," +\
"ethnicity_grouped as eth_grp," +\
"hospital_expire_flag," +\
"los_icu " +\
"FROM demographics " +\
"LEFT JOIN pivoted_lab " +\
"ON demographics.icustay_id = pivoted_lab.icustay_id " +\
"WHERE first_icu_stay = true"
df = query_database(query)
df
| subject_id | hadm_id | icustay_id | aniongap_min | aniongap_max | albumin_min | albumin_max | bands_min | bands_max | bicarbonate_min | ... | sodium_max | bun_min | bun_max | wbc_min | wbc_max | gender | admission_age | eth_grp | hospital_expire_flag | los_icu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10006 | 142345 | 206504 | 12.0 | 20.0 | 2.7 | 3.4 | NaN | NaN | 29.0 | ... | 139.0 | 9.0 | 11.0 | 4.6 | 7.8 | F | 70.0 | black | 0 | 1.0 |
| 1 | 10011 | 105331 | 232110 | 12.0 | 12.0 | 2.6 | 2.6 | 2.0 | 2.0 | 23.0 | ... | 136.0 | 3.0 | 3.0 | 10.6 | 10.6 | F | 36.0 | unknown | 1 | 13.0 |
| 2 | 10013 | 165520 | 264446 | 13.0 | 13.0 | NaN | NaN | 13.0 | 13.0 | 29.0 | ... | 138.0 | 32.0 | 32.0 | 13.8 | 16.2 | F | 87.0 | unknown | 1 | 2.0 |
| 3 | 10017 | 199207 | 204881 | 13.0 | 13.0 | 2.8 | 2.8 | NaN | NaN | 29.0 | ... | 139.0 | 3.0 | 3.0 | 15.8 | 15.8 | F | 74.0 | white | 0 | 2.0 |
| 4 | 10019 | 177759 | 228977 | 20.0 | 46.0 | 3.2 | 3.2 | NaN | NaN | 10.0 | ... | 141.0 | 31.0 | 53.0 | 3.7 | 6.8 | M | 49.0 | white | 1 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 123 | 44083 | 198330 | 286428 | 16.0 | 16.0 | NaN | NaN | NaN | NaN | 21.0 | ... | 142.0 | 12.0 | 12.0 | 12.3 | 14.9 | M | 55.0 | white | 0 | 3.0 |
| 124 | 44154 | 174245 | 217724 | 15.0 | 15.0 | NaN | NaN | NaN | NaN | 19.0 | ... | 142.0 | 16.0 | 21.0 | 12.2 | 17.1 | M | 300.0 | white | 1 | 0.0 |
| 125 | 44212 | 163189 | 239396 | 15.0 | 21.0 | 2.9 | 3.0 | NaN | NaN | 18.0 | ... | 150.0 | 37.0 | 57.0 | 8.8 | 11.4 | F | 45.0 | black | 0 | 31.0 |
| 126 | 44222 | 192189 | 238186 | 11.0 | 15.0 | NaN | NaN | NaN | NaN | 22.0 | ... | 135.0 | 21.0 | 24.0 | 9.3 | 9.9 | M | 73.0 | white | 0 | 1.0 |
| 127 | 44228 | 103379 | 217992 | 12.0 | 18.0 | 2.2 | 2.7 | NaN | NaN | 15.0 | ... | 142.0 | 10.0 | 11.0 | 7.0 | 41.9 | F | 58.0 | white | 0 | 4.0 |
128 rows × 46 columns
5.2. Close the connection to your SQL server.
conn.close()
5.3. Prepare your dataset:
# Drop ID columns
df.drop(columns=['subject_id', 'hadm_id', 'icustay_id'], inplace=True)
# Drop columns with at least one NA value
df.dropna(axis=1, inplace=True)
# Encode categorical columns
df = pd.get_dummies(df, columns=['eth_grp', 'gender'])
# Set hospital_expire_flag as response and remaining columns as predictors
response = df['hospital_expire_flag']
predictors = df.drop(columns=['hospital_expire_flag'])
6. Fit the following tree-based classifiers to the dataset. For each method:
Perform k-fold cross validation to evaluate the models. Consider 10 folds.
Plot the ROC curves for each fold, along with the mean ROC curve.
Calculate the mean AUC.
a. Decision tree.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
import numpy as np
def KFoldCV_ROC(model):
kfold = KFold(n_splits=10)
scores = []
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
for train, test in kfold.split(predictors, response):
model.fit(predictors.iloc[train], response.iloc[train])
scores.append(model.score(predictors.iloc[test], response.iloc[test]))
fpr, tpr, _ = roc_curve(response.iloc[test], model.predict_proba(predictors.iloc[test])[:,1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2)
for i in range(len(tprs)):
plt.plot(mean_fpr, tprs[i], lw=1, alpha=0.3)
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, lw=2.5)
plt.title('ROC Curves')
plt.show()
print("Mean AUC: " + str(np.mean(aucs)))
return
model = DecisionTreeClassifier()
KFoldCV_ROC(model)
Mean AUC: 0.6038455988455989
b. Random forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
KFoldCV_ROC(model)
Mean AUC: 0.7204870129870129
c. Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
KFoldCV_ROC(model)
Mean AUC: 0.7416666666666667
7.1. Perform a grid search cross-validation on the Gradient boosting methods, changing the value of the learning rate (0.01 to 0.5) and the number of estimators (50-500). Consider the mean AUC of the folds as the performance measure.
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.5], 'n_estimators': [50, 75, 100, 250, 500]}
grid_search = GridSearchCV(model, parameters, cv=kfold, scoring='roc_auc')
grid_search.fit(predictors, response)
print("Best parameters:")
print(grid_search.best_params_)
print("Mean AUC score:")
print(grid_search.best_score_)
Best parameters:
{'learning_rate': 0.1, 'n_estimators': 75}
Mean AUC score:
0.7608730158730159
7.2. Plot a scatterplot of the learning rate versus the number of estimators, with the mean AUC as the color gradient.
scores = grid_search.cv_results_['mean_test_score']
learning_rates = grid_search.cv_results_['param_learning_rate'].data
n_estimators = grid_search.cv_results_['param_n_estimators'].data
plt.scatter(x=n_estimators, y=learning_rates, c=scores, s=100)
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x1eb1fbedfa0>
8.1. Perform forward stepwise selection on the dataset. Use the best parameters of the gradient boosting method obtained in 7.1..
from sklearn.feature_selection import SequentialFeatureSelector
selector = SequentialFeatureSelector(estimator=grid_search.best_estimator_,
direction="forward",
scoring="roc_auc",
n_features_to_select=10)
selector.fit(predictors, response)
print("Features:")
print(list(predictors.columns[selector.support_]))
Features: ['sodium_min', 'wbc_max', 'admission_age', 'eth_grp_asian', 'eth_grp_black', 'eth_grp_native', 'eth_grp_other', 'eth_grp_white', 'gender_F', 'gender_M']
8.2. Compare and comment the results from 8.1. with the features importance obtained through the grid search of queastion 7.1..
sorted_importances_ind = np.argsort(grid_search.best_estimator_.feature_importances_)
print("Features from the Grid Search procedure:")
print(list(predictors.columns[sorted_importances_ind[0:10]]))
print()
print("Features from the forward stepwise selection procedure:")
print(list(predictors.columns[selector.get_support(indices=True)]))
Features from the Grid Search procedure: ['eth_grp_asian', 'eth_grp_black', 'eth_grp_native', 'eth_grp_unknown', 'eth_grp_other', 'gender_M', 'creatinine_min', 'gender_F', 'eth_grp_hispanic', 'potassium_max'] Features from the forward stepwise selection procedure: ['sodium_min', 'wbc_max', 'admission_age', 'eth_grp_asian', 'eth_grp_black', 'eth_grp_native', 'eth_grp_other', 'eth_grp_white', 'gender_F', 'gender_M']
We can see that the two methods differ in the features selected. This may be the case since Forward stepwise selection is a greedy algorithm, meaning that it makes decisions based on the current state of the model without considering the long-term consequences. As a result, it may not always find the optimal combination of predictors.
1. Consider a dataset where best subset, forward stepwise and backward stepwise selection will be performed. For each of the 3 approaches, we obtain $p+1$ models, $p$ being the total number of predictors. This means that each approach has a model with 0 predictors, one with 1 predictor, one with 2 predictor, up until one model with $p$ predictors. Answer and justify the following questions:
a) Which of the three models with $k, \, \forall_{k \in [0,p]}$ predictors has the smallest training RSS?
Best subset selection is a method for selecting a subset of predictors in a multiple regression model. It involves selecting the best model that includes all possible combinations of predictors. Forward stepwise selection is a method for selecting a subset of predictors in a multiple regression model. It involves starting with an empty model and iteratively adding one predictor at a time until all predictors are included in the model. Backward stepwise selection is a method for selecting a subset of predictors in a multiple regression model. It involves starting with a model that includes all predictors and iteratively removing one predictor at a time until only one predictor is left in the model.
The model that uses the best subset approach is the one that has the smallest training RSS, although forward stepwise and backward stepwise seletion may choose equally excellent models. The best subset selection method searches all possible models with k predictors, selecting the smallest training RSS, whereas the other two methods iteratively explore a subset starting with the best k-1 model and selecting the best k given a fixed k-1 or the reverse.
b) Which of the three models with $k, \, \forall_{k \in [0,p]}$ predictors has the smallest test RSS?
The model that uses the best subset approach is the one that has the smallest test RSS, given that it considers more models than the other approaches. However, it is probable to overfit and the others approaches can chose a model that fits the test data better.
c) Evaluate the following statements with true or false. Justify your answers.
i. The predictors in the k-variable model identified by forward stepwise selection are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
True. k+1 variable is the same as k variable with one predictor added.
ii. The predictors in the k-variable model identified by backward stepwise selection are a subset of the predictors in the (k + 1)-variable model identified by backward stepwise selection.
True. k variable is the same as k+1 variable with one predictor removed.
iii. The predictors in the k-variable model identified by backward stepwise selection are a subset of the predictors in the (k + 1)-variable model identified by forward stepwise selection.
False. Given that forward stepwise starts with an empty model and iteratively adds predictors and the backward stepwise starts with all the predictors and iteratively removes until the model is empty, the path isn't the same. Thus, usually the predictors in the k variable model identified by backward stepwise aren't a subset of the predictors in the k+1 variable model identified by forward stepwise. However, this could happen for a particular example.
iv. The predictors in the k-variable model identified by forward stepwise selection are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
False. Given that forward stepwise starts with an empty model and iteratively adds predictors and the backward stepwise starts with all the predictors and iteratively removes until the model is empty, the path isn't the same. Thus, usually the predictors in the k variable model identified by forward stepwise aren't a subset of the predictors in the k+1 variable model identified by backward stepwise. However, this could happen for a particular example.
v. The predictors in the k-variable model identified by best subset selection are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
False. It is not guaranteed that it will choose the same predictors.
2. Ridge regression tends to give similar coefficient values to correlated variables, whereas lasso regression may give substantially different coefficients to correlated variables. This questions explores this property in a simplified setting.
Suppose that $n=2$, $p=2$, $x_{11} = x_{12}$, $x_{21} = x_{22}$. Moreover, suppose that $y_1 + y_2 = 0$ and $x_{11} + x_{21} = 0$ and $x_{12} + x_{22} = 0$, meaning that the estimate for the intercept in a least squares, ridge regression, or lasso regression is zero: $\hat{\beta} = 0$.
a) Write the ridge regression optimization problem in this setting.
b) Prove that in this setting, the ridge regression coefficient estimates satisfy $\hat{\beta}_1 = \hat{\beta}_2$.
c) Write the lasso regression optimization problem in this setting.
d) Prove that in this setting, the lasso regression coefficients $\hat{\beta}_1$ and $\hat{\beta}_2$ are not unique, meaning that there are many possible solutions to the optimization problem in (c). Describe these solutions.
3. Draw an example of a partition of two-dimensional feature space that could result from recursive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions R1, R2,..., the cutpoints t1, t2,..., and so forth.
If you prefer you can draw it by hand or in any software and use a scan of it.
4. In 2 dimensions, a linear decision boundary takes the form $\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$. Consider a nn-linear decision boundary:
a) Sketch the curve
$$(1 + X_1)^2 + (2 - X_2)^2 = 4$$Additionally, indicate on your sketch the set of points that verify the condition
$$(1 + X_1)^2 + (2 - X_2)^2 > 4$$and the condition
$$(1 + X_1)^2 + (2 - X_2)^2 \leq 4$$import matplotlib.pyplot as plt
import numpy as np
X1 = np.linspace(-5, 5, 100)
X2 = np.linspace(-2.5, 7.5, 100)
X1, X2 = np.meshgrid(X1, X2)
def f(x1, x2):
return (1+x1)**2 + (2-x2)**2
Y = np.vectorize(f)(X1, X2)
plt.contour(X1, X2, Y, levels=[4])
plt.xlabel('X1')
plt.ylabel('X2')
plt.title('Contour plot of (1+X1)^2 + (2-X2)^2 = 4')
plt.annotate('< 4', xy=(-1, 2), xytext=(-1, 2))
plt.annotate('> 4', xy=(3, 2), xytext=(3, 2))
plt.show()
b) Suppose that a classifier assigns an observation to the blue class if $(1 + X_1)^2 + (2 - X_2)^2 > 4$ and to the red class otherwise. To what class are the following observations classified? (0,0), (-1,1), (2,2), (3,8)
points = [(0,0),(-1,1),(2,2),(3,8)]
for point in points:
red = False
if f(point[0],point[1]) > 4:
red = False
print(point, ': BLUE')
else:
print(point, ': RED')
red = True
(0, 0) : BLUE (-1, 1) : RED (2, 2) : BLUE (3, 8) : BLUE
c) Prove that while the decision boundary in (b) is not linear in terms of $X_1$ and $X_2$, it is linear in terms of $X_1$, $X_1^2$, $X_2$, and $X_2^2$.
To prove that, we want to find a,b,c,d,e such that: $a*X1$ + $b*X1^2$ + $c*X2$ + $d*X2^2$ + $e$ $\leq 0$
By expanding the provided equation we get: $X1^2$ + $2*X1$ + $X2^2$ - $4*X2$ - $6$ $\leq 0$
So we get: $a = 1$, $b = 2$, $c = 1$, $d = -4$, $e = -6$
What are the advantages and disadvantages of relational dabases versus graph databases, and when should one type be preferred over the other?
The main difference between these two types of databases is how they store relationships between entities.
In a relational database, data is organized into tables made up of rows and columns. Each row, or record, contains all the information about a specific item, and each column represents a specific piece of information about that item.
Graph databases are designed for storing and processing data with intricate relationships between entities. Rather than using tables to organize data, graph databases represent data using nodes and edges, which depict the connections between data points.
Relational databases are faster, if a significant number of records is stored. Therefore, they tend to have a smaller memory footprint
When dealing with heavy relationship use cases, graph databases are better suited. They are particularly useful for analyzing intricate data sets, thus being a better choice when intricate relationships have to be found, such as recomendation engines, fraud detection or social media.
Graph databases should be used if:
Relational databases should be used if:
The main advantages and disadvantages are:
Advantages of relational databases:
Disadvantages of relational databases:
Advantages of graph databases:
Disadvantages of of graph databases:
Thus, relational databases are preferred when the data is structured and the queries are complex, while graph databases are preferred when the data is highly connected and the queries involve traversing relationships.
For example, when building a social network, the profile information of user, such as age or sex could be saved using a relatioanal database. When saving the information about wihch friends are friends with who, a graph databse could be more suitable since you might create an edge between two user nodes to represent that they are friends instead of storing all the connections for all users using a table structure like in a relational database.